One Parameter

In Chapter @ref(sampling), we studied sampling. We started with a “tactile” exercise where we wanted to know the proportion of balls in the urn in Figure @ref(fig:sampling-exercise-1) that are red. While we could have performed an exhaustive count, this would have been a tedious process. So instead, we used a shovel to extract a sample of 50 balls and used the resulting proportion that were red as an estimate. Furthermore, we made sure to mix the urn’s contents before every use of the shovel. Because of the randomness created by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the urn’s balls that are red.

Remember: There is a truth here. There is an urn. It has red and white balls in it. An exact, but unknown, number of the balls are red. An exact, but unknown, number of the balls are white. An exact, but unknown, percentage of the balls are red – defined as the number red divided by the sum of the number red and the number white. Our goal is to estimate that unknown percentage. We want to make statements about the world, even if we can never be certain that those statements are true. We will never have the time or inclination to actually count all the balls. We use the term parameter for things that exist but which are unknown. We use statistics to estimate the true values of parameters.

We then mimicked this physical sampling exercise with an equivalent virtual sampling exercise using the computer. In Subsection @ref(different-shovels), we repeated this sampling procedure 1,000 times, using three different virtual shovels with 25, 50, and 100 slots. We visualized these three sets of 1,000 estimates in Figure @ref(fig:comparing-sampling-distributions-3) and saw that as the sample size increased, the variation in the estimates decreased. We then expanded this for all sample sizes from 1 to 100.

In doing so, we constructed sampling distributions. The motivation for taking a 1,000 repeated samples and visualizing the resulting estimates was to study how these estimates varied from one sample to another; in other words, we wanted to study the effect of sampling variation. We quantified the variation of these estimates using their standard deviation, which has a special name: the standard error. In particular, we saw that as the sample size increased from 1 to 100, the standard error decreased and thus the sampling distributions narrowed. Larger sample sizes led to more precise estimates that varied less around the center.

We then tied these sampling exercises to terminology and mathematical notation related to sampling in Subsection @ref(terminology-and-notation). Our study population was the large urn with \(N\) = 2,400 balls, while the population parameter, the unknown quantity of interest, was the population proportion \(p\) of the urn’s balls that were red. Since performing a census would be expensive in terms of time and energy, we instead extracted a sample of size \(n\) = 50. The point estimate, also known as a sample statistic, used to estimate \(p\) was the sample proportion \(\widehat{p}\) of these 50 sampled balls that were red. Furthermore, since the sample was obtained at random, it can be considered as unbiased and as representative of the population. Thus any results based on the sample could be generalized to the population. Therefore, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the urn’s balls that are red. In other words, we used the sample to draw inferences about the population.

However, as described in Section @ref(sampling-simulation), both the physical and virtual sampling exercises are not what one would do in real life. This was merely an activity used to study the effects of sampling variation. In a real life situation, we would not take 1,000 samples of size \(n\), but rather take a single representative sample that’s as large as possible. Additionally, we knew that the true proportion of the urn’s balls that were red was 37.5%. In a real-life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it?

An example of a realistic sampling situation would be a poll, like the Obama poll you saw in Section @ref(sampling-case-study). Pollsters did not know the true proportion of all young Americans who supported President Obama in 2013, and thus they took a single sample of size \(n\) = 2,089 young Americans to estimate this value.

So how does one quantify the effects of sampling variation when you only have a single sample to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is bootstrapping resampling.

What if we would like, not only a single estimate of the unknown population parameter, but also a range of highly plausible values? Going back to the Obama poll article, it stated that the pollsters’ estimate of the proportion of all young Americans who supported President Obama was 41%. But in addition it stated that the poll’s “margin of error was plus or minus 2.1 percentage points.” This “plausible range” was [41% - 2.1%, 41% + 2.1%] = [38.9%, 43.1%]. This range of plausible values is what’s known as a confidence interval, which will be the focus of the later sections of this chapter.

Needed packages

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section @ref(tidyverse-package) that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once:

  • ggplot2 for data visualization
  • dplyr for data wrangling
  • tidyr for converting data to tidy format
  • readr for importing spreadsheet data into R
  • As well as the more advanced purrr, tibble, stringr, and forcats packages

If needed, read Section @ref(packages) for information on how to install and load R packages.

library(tidyverse)

Pennies activity

As we did in Chapter @ref(sampling), we’ll begin with a hands-on tactile activity.

What is the average year on US pennies in 2019?

Try to imagine all the pennies being used in the United States in 2019. That’s a lot of pennies! Now say we’re interested in the average year of minting of all these pennies. One way to compute this value would be to gather up all pennies being used in the US, record the year, and compute the average. However, this would be near impossible! So instead, let’s collect a sample of 50 pennies from a local bank in downtown Northampton, Massachusetts, USA as seen in Figure @ref(fig:resampling-exercise-a).

Collecting a sample of 50 US pennies from a local bank.

Collecting a sample of 50 US pennies from a local bank.

Collecting a sample of 50 US pennies from a local bank.

Collecting a sample of 50 US pennies from a local bank.

An image of these 50 pennies can be seen in Figure @ref(fig:resampling-exercise-c). For each of the 50 pennies starting in the top left, progressing row-by-row, and ending in the bottom right, note there is an “ID” identification variable printed in black and the year of minting printed in white.

50 US pennies labelled.

50 US pennies labelled.

The moderndive package contains this data on our 50 sampled pennies in the pennies_sample data frame:

library(moderndive)

pennies_sample
## # A tibble: 50 x 2
##       ID  year
##    <int> <dbl>
##  1     1  2002
##  2     2  1986
##  3     3  2017
##  4     4  1988
##  5     5  2008
##  6     6  1983
##  7     7  2008
##  8     8  1996
##  9     9  2004
## 10    10  2000
## # … with 40 more rows

The pennies_sample data frame has 50 rows corresponding to each penny with two variables. The first variable ID corresponds to the ID labels in Figure @ref(fig:resampling-exercise-c), whereas the second variable year corresponds to the year of minting saved as a numeric variable, also known as a double (dbl).

Based on these 50 sampled pennies, what can we say about all US pennies in 2019? Let’s study some properties of our sample by performing an exploratory data analysis. Let’s first visualize the distribution of the year of these 50 pennies using our data visualization tools from Chapter @ref(viz). Since year is a numerical variable, we use a histogram in Figure @ref(fig:pennies-sample-histogram) to visualize its distribution.

pennies_sample %>%
  ggplot(aes(x = year)) +
  geom_histogram(binwidth = 10, color = "white")
Distribution of year on 50 US pennies.

Distribution of year on 50 US pennies.

Observe a slightly left-skewed distribution, since most pennies fall between 1980 and 2010 with only a few pennies older than 1970. What is the average year for the 50 sampled pennies? Eyeballing the histogram it appears to be around 1990. Let’s now compute this value exactly using our data wrangling tools from Chapter @ref(wrangling).

pennies_sample %>% 
  summarize(mean_year = mean(year))
## # A tibble: 1 x 1
##   mean_year
##       <dbl>
## 1     1995.

Thus, if we’re willing to assume that pennies_sample is a representative sample from all US pennies, a “good guess” of the average year of minting of all US pennies would be 1995.44. In other words, around 1995. This should all start sounding similar to what we did previously in Chapter @ref(sampling)!

In Chapter @ref(sampling), our study population was the urn of \(N\) = 2400 balls. Our population parameter was the population proportion of these balls that were red, denoted by \(p\). In order to estimate \(p\), we extracted a sample of 50 balls using the shovel. We then computed the relevant point estimate: the sample proportion of these 50 balls that were red, denoted mathematically by \(\widehat{p}\).

Here our population is \(N\) = whatever the number of pennies are being used in the US, a value which we don’t know and probably never will. The population parameter of interest is now the population mean year of all these pennies, a value denoted mathematically by the Greek letter \(\mu\) (pronounced “mu”). In order to estimate \(\mu\), we went to the bank and obtained a sample of 50 pennies and computed the relevant point estimate: the sample mean year of these 50 pennies, denoted mathematically by \(\overline{x}\) (pronounced “x-bar”). An alternative and more intuitive notation for the sample mean is \(\widehat{\mu}\). However, this is unfortunately not as commonly used, so in this book we’ll stick with convention and always denote the sample mean as \(\overline{x}\).

We summarize the correspondence between the sampling urn exercise in Chapter @ref(sampling) and our pennies exercise in Table @ref(tab:table-ch8-b).

Scenarios of sampling for inference
Scenario Population parameter Notation Point estimate Symbol(s)
1 Population proportion \(p\) Sample proportion \(\widehat{p}\)
2 Population mean \(\mu\) Sample mean \(\overline{x}\) or \(\widehat{\mu}\)

Going back to our 50 sampled pennies in Figure @ref(fig:resampling-exercise-c), the point estimate of interest is the sample mean \(\overline{x}\) of 1995.44. This quantity is an estimate of the population mean year of all US pennies \(\mu\).

Recall that we also saw in Chapter @ref(sampling) that such estimates are prone to sampling variation. For example, in this particular sample in Figure @ref(fig:resampling-exercise-c), we observed three pennies with the year 1999. If we sampled another 50 pennies, would we observe exactly three pennies with the year 1999 again? More than likely not. We might observe none, one, two, or maybe even all 50! The same can be said for the other 26 unique years that are represented in our sample of 50 pennies.

To study the effects of sampling variation in Chapter @ref(sampling), we took many samples, something we could easily do with our shovel. In our case with pennies, however, how would we obtain another sample? By going to the bank and getting another roll of 50 pennies.

Say we’re feeling lazy, however, and don’t want to go back to the bank. How can we study the effects of sampling variation using our single sample? We will do so using a technique known as bootstrap resampling with replacement, which we now illustrate.

Resampling once

Step 1: Let’s print out identically sized slips of paper representing our 50 pennies as seen in Figure @ref(fig:tactile-resampling-1).

Step 1: 50 slips of paper representing 50 US pennies.

Step 1: 50 slips of paper representing 50 US pennies.

Step 2: Put the 50 slips of paper into a hat or tuque as seen in Figure @ref(fig:tactile-resampling-2).

Step 2: Putting 50 slips of paper in a hat.

Step 2: Putting 50 slips of paper in a hat.

Step 3: Mix the hat’s contents and draw one slip of paper at random as seen in Figure @ref(fig:tactile-resampling-3). Record the year.

Step 3: Drawing one slip of paper at random.

Step 3: Drawing one slip of paper at random.

Step 4: Put the slip of paper back in the hat! In other words, replace it as seen in Figure @ref(fig:tactile-resampling-4).

Step 4: Replacing slip of paper.

Step 4: Replacing slip of paper.

Step 5: Repeat Steps 3 and 4 a total of 49 more times, resulting in 50 recorded years.

What we just performed was a resampling of the original sample of 50 pennies. We are not sampling 50 pennies from the population of all US pennies as we did in our trip to the bank. Instead, we are mimicking this act by resampling 50 pennies from our original sample of 50 pennies.

Now ask yourselves, why did we replace our resampled slip of paper back into the hat in Step 4? Because if we left the slip of paper out of the hat each time we performed Step 4, we would end up with the same 50 original pennies! In other words, replacing the slips of paper induces sampling variation.

Being more precise with our terminology, we just performed a resampling with replacement from the original sample of 50 pennies. Had we left the slip of paper out of the hat each time we performed Step 4, this would be resampling without replacement.

Let’s study our 50 resampled pennies via an exploratory data analysis. First, let’s load the data into R by manually creating a data frame pennies_resample of our 50 resampled values. We’ll do this using the tibble() command from the dplyr package. Note that the 50 values you resample will almost certainly not be the same as ours given the inherent randomness.

pennies_resample <- tibble(
  year = c(1976, 1962, 1976, 1983, 2017, 2015, 2015, 1962, 2016, 1976, 
           2006, 1997, 1988, 2015, 2015, 1988, 2016, 1978, 1979, 1997, 
           1974, 2013, 1978, 2015, 2008, 1982, 1986, 1979, 1981, 2004, 
           2000, 1995, 1999, 2006, 1979, 2015, 1979, 1998, 1981, 2015, 
           2000, 1999, 1988, 2017, 1992, 1997, 1990, 1988, 2006, 2000)
)

The 50 values of year in pennies_resample represent a resample of size 50 from the original sample of 50 pennies. We display the 50 resampled pennies in Figure @ref(fig:resampling-exercise-d).

50 resampled US pennies labelled.

50 resampled US pennies labelled.

Let’s compare the distribution of the numerical variable year of our 50 resampled pennies with the distribution of the numerical variable year of our original sample of 50 pennies in Figure @ref(fig:origandresample).

pennies_resample %>%
  ggplot(aes(x = year)) +
  geom_histogram(binwidth = 10, color = "white") +
  labs(title = "Resample of 50 pennies")

pennies_sample %>%
  ggplot(aes(x = year)) +
  geom_histogram(binwidth = 10, color = "white") +
  labs(title = "Original sample of 50 pennies")

(ref:compare-plots) Comparing year in the resampled pennies_resample with the original sample pennies_sample.

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).
(ref:compare-plots)

(ref:compare-plots)

Observe in Figure @ref(fig:origandresample) that while the general shapes of both distributions of year are roughly similar, they are not identical.

Recall from the previous section that the sample mean of the original sample of 50 pennies from the bank was 1995.44. What about for our resample? Any guesses? Let’s have dplyr help us out as before:

pennies_resample %>% 
  summarize(mean_year = mean(year))
## # A tibble: 1 x 1
##   mean_year
##       <dbl>
## 1     1997.

We obtained a different mean year of 1996.94. This variation is induced by the resampling with replacement we performed earlier.

What if we repeated this resampling exercise many times? Would we obtain the same mean year each time? In other words, would our guess at the mean year of all pennies in the US in 2019 be exactly 1996.94 every time? Just as we did in Chapter @ref(sampling), let’s perform this resampling activity with the help of some of our friends: 35 friends in total.

Resampling 35 times

Each of our 35 friends will repeat the same five steps:

  1. Start with 50 identically sized slips of paper representing the 50 pennies.
  2. Put the 50 small pieces of paper into a hat or beanie cap.
  3. Mix the hat’s contents and draw one slip of paper at random. Record the year in a spreadsheet.
  4. Replace the slip of paper back in the hat!
  5. Repeat Steps 3 and 4 a total of 49 more times, resulting in 50 recorded years.

Since we had 35 of our friends perform this task, we will end up with \(35 \cdot 50 = 1750\) values. These values are recorded in a shared spreadsheet with 50 rows (plus a header row) and 35 columns. We display a snapshot of the first 10 rows and five columns of this shared spreadsheet in Figure @ref(fig:tactile-resampling-5).

Snapshot of shared spreadsheet of resampled pennies.

Snapshot of shared spreadsheet of resampled pennies.

These 35 \(\cdot\) 50 = 1750 values are saved in pennies_resamples, a “tidy” data frame included in the moderndive package. We saw what it means for a data frame to be “tidy” in Subsection @ref(tidy-definition).

pennies_resamples
## # A tibble: 1,750 x 3
## # Groups:   name [35]
##    replicate name     year
##        <int> <chr>   <dbl>
##  1         1 Arianna  1988
##  2         1 Arianna  2002
##  3         1 Arianna  2015
##  4         1 Arianna  1998
##  5         1 Arianna  1979
##  6         1 Arianna  1971
##  7         1 Arianna  1971
##  8         1 Arianna  2015
##  9         1 Arianna  1988
## 10         1 Arianna  1979
## # … with 1,740 more rows

What did each of our 35 friends obtain as the mean year? Once again, dplyr to the rescue! After grouping the rows by name, we summarize each group of 50 rows by their mean year:

resampled_means <- pennies_resamples %>% 
  group_by(name) %>% 
  summarize(mean_year = mean(year))
resampled_means
## # A tibble: 35 x 2
##    name      mean_year
##    <chr>         <dbl>
##  1 Arianna       1992.
##  2 Artemis       1996.
##  3 Bea           1996.
##  4 Camryn        1997.
##  5 Cassandra     1991.
##  6 Cindy         1995.
##  7 Claire        1996.
##  8 Dahlia        1998.
##  9 Dan           1994.
## 10 Eindra        1994.
## # … with 25 more rows

Observe that resampled_means has 35 rows corresponding to the 35 means based on the 35 resamples. Furthermore, observe the variation in the 35 values in the variable mean_year. Let’s visualize this variation using a histogram in Figure @ref(fig:tactile-resampling-6). Recall that adding the argument boundary = 1990 to the geom_histogram() sets the binning structure so that one of the bin boundaries is at 1990 exactly.

ggplot(resampled_means, aes(x = mean_year)) +
  geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
  labs(x = "Sampled mean year")
Distribution of 35 sample means from 35 resamples.

Distribution of 35 sample means from 35 resamples.

Observe in Figure @ref(fig:tactile-resampling-6) that the distribution looks roughly normal and that we rarely observe sample mean years less than 1992 or greater than 2000. Also observe how the distribution is roughly centered at 1995, which is close to the sample mean of 1995.44 of the original sample of 50 pennies from the bank.

What did we just do?

What we just demonstrated in this activity is the statistical procedure known as bootstrap resampling with replacement. We used resampling to mimic the sampling variation we studied in Chapter @ref(sampling) on sampling. However, in this case, we did so using only a single sample from the population.

In fact, the histogram of sample means from 35 resamples in Figure @ref(fig:tactile-resampling-6) is called the bootstrap distribution. It is an approximation to the sampling distribution of the sample mean, in the sense that both distributions will have a similar shape and similar spread. In fact in the upcoming Section @ref(ci-conclusion), we’ll show you that this is the case. Using this bootstrap distribution, we can study the effect of sampling variation on our estimates. In particular, we’ll study the typical “error” of our estimates, known as the standard error.

Starting in Subsection @ref(resampling-simulation), we’ll mimic our tactile resampling activity virtually on the computer, allowing us to quickly perform the resampling many more than 35 times. In Section @ref(ci-build-up) we’ll define the statistical concept of a confidence interval, which builds off the concept of bootstrap distributions.

As we did in Chapter @ref(sampling), we’ll tie all these ideas together with a real-life case study in Section @ref(case-study-two-prop-ci). This time we’ll look at data from an experiment about yawning from the US television show Mythbusters.

Let’s now mimic our tactile resampling activity virtually with a computer.

Virtually resampling once

First, let’s perform the virtual analog of resampling once. Recall that the pennies_sample data frame included in the moderndive package contains the years of our original sample of 50 pennies from the bank. Furthermore, recall in Chapter @ref(sampling) on sampling that we used the rep_sample_n() function in the infer package as a virtual shovel to sample balls from our virtual urn of 2400 balls as follows:

library(infer)

virtual_shovel <- urn %>% 
  rep_sample_n(size = 50)

Let’s modify this code to perform the resampling with replacement of the 50 slips of paper representing our original sample of 50 pennies:

virtual_resample <- pennies_sample %>% 
  rep_sample_n(size = 50, replace = TRUE)

Observe how we explicitly set the replace argument to TRUE in order to tell rep_sample_n() that we would like to sample pennies with replacement. Had we not set replace = TRUE, the function would’ve assumed the default value of FALSE and hence done resampling without replacement. Additionally, since we didn’t specify the number of replicates via the reps argument, the function assumes the default of one replicate reps = 1. Lastly, observe also that the size argument is set to match the original sample size of 50 pennies.

Let’s look at only the first 10 out of 50 rows of virtual_resample:

virtual_resample
## # A tibble: 50 x 3
## # Groups:   replicate [1]
##    replicate    ID  year
##        <int> <int> <dbl>
##  1         1    39  2015
##  2         1    37  1962
##  3         1    13  2015
##  4         1    36  2015
##  5         1     4  1988
##  6         1     1  2002
##  7         1    12  1995
##  8         1    27  1993
##  9         1     5  2008
## 10         1    10  2000
## # … with 40 more rows

The replicate variable only takes on the value of 1 corresponding to us only having reps = 1, the ID variable indicates which of the 50 pennies from pennies_sample was resampled, and year denotes the year of minting. Let’s now compute the mean year in our virtual resample of size 50 using data wrangling functions included in the dplyr package:

virtual_resample %>% 
  summarize(resample_mean = mean(year))
## # A tibble: 1 x 2
##   replicate resample_mean
##       <int>         <dbl>
## 1         1         1998.

As we saw when we did our tactile resampling exercise, the resulting mean year is different than the mean year of our 50 originally sampled pennies of 1995.44.

Virtually resampling 35 times

Let’s now perform the virtual analog of our 35 friends’ resampling. Using these results, we’ll be able to study the variability in the sample means from 35 resamples of size 50. Let’s first add a reps = 35 argument to rep_sample_n() to indicate we would like 35 replicates. Thus, we want to repeat the resampling with the replacement of 50 pennies 35 times.

virtual_resamples <- pennies_sample %>% 
  rep_sample_n(size = 50, replace = TRUE, reps = 35)
virtual_resamples
## # A tibble: 1,750 x 3
## # Groups:   replicate [35]
##    replicate    ID  year
##        <int> <int> <dbl>
##  1         1    33  1979
##  2         1    36  2015
##  3         1    46  2017
##  4         1     7  2008
##  5         1    12  1995
##  6         1    33  1979
##  7         1    20  1971
##  8         1    10  2000
##  9         1    28  2006
## 10         1    18  1996
## # … with 1,740 more rows

The resulting virtual_resamples data frame has 35 \(\cdot\) 50 = 1750 rows corresponding to 35 resamples of 50 pennies. Let’s now compute the resulting 35 sample means using the same dplyr code as we did in the previous section, but this time adding a group_by(replicate):

virtual_resampled_means <- virtual_resamples %>% 
  group_by(replicate) %>% 
  summarize(mean_year = mean(year))
virtual_resampled_means
## # A tibble: 35 x 2
##    replicate mean_year
##        <int>     <dbl>
##  1         1     1996.
##  2         2     1995.
##  3         3     1990.
##  4         4     1996.
##  5         5     1997.
##  6         6     1994.
##  7         7     1998.
##  8         8     1993.
##  9         9     1999.
## 10        10     1993.
## # … with 25 more rows

Observe that virtual_resampled_means has 35 rows, corresponding to the 35 resampled means. Furthermore, observe that the values of mean_year vary. Let’s visualize this variation using a histogram in Figure @ref(fig:tactile-resampling-7).

ggplot(virtual_resampled_means, aes(x = mean_year)) +
  geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
  labs(x = "Resample mean year")
Distribution of 35 sample means from 35 resamples.

Distribution of 35 sample means from 35 resamples.

Let’s compare our virtually constructed bootstrap distribution with the one our 35 friends constructed via our tactile resampling exercise in Figure @ref(fig:orig-and-resample-means). Observe how they are somewhat similar, but not identical.

Comparing distributions of means from resamples.

Comparing distributions of means from resamples.

Recall that in the “resampling with replacement” scenario we are illustrating here, both of these histograms have a special name: the bootstrap distribution of the sample mean. Furthermore, recall they are an approximation to the sampling distribution of the sample mean, a concept you saw in Chapter @ref(sampling) on sampling. These distributions allow us to study the effect of sampling variation on our estimates of the true population mean, in this case the true mean year for all US pennies. However, unlike in Chapter @ref(sampling) where we took multiple samples (something one would never do in practice), bootstrap distributions are constructed by taking multiple resamples from a single sample: in this case, the 50 original pennies from the bank.

Virtually resampling 1,000 times

Remember that one of the goals of resampling with replacement is to construct the bootstrap distribution, which is an approximation of the sampling distribution. However, the bootstrap distribution in Figure @ref(fig:tactile-resampling-7) is based only on 35 resamples and hence looks a little coarse. Let’s increase the number of resamples to 1,000, so that we can hopefully better see the shape and the variability between different resamples.

# Repeat resampling 1,000 times

virtual_resamples <- pennies_sample %>% 
  rep_sample_n(size = 50, replace = TRUE, reps = 1000)

# Compute 1,000 sample means

virtual_resampled_means <- virtual_resamples %>% 
  group_by(replicate) %>% 
  summarize(mean_year = mean(year))

However, in the interest of brevity, going forward let’s combine these two operations into a single chain of pipe (%>%) operators:

virtual_resampled_means <- pennies_sample %>% 
  rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>% 
  group_by(replicate) %>% 
  summarize(mean_year = mean(year))
virtual_resampled_means
## # A tibble: 1,000 x 2
##    replicate mean_year
##        <int>     <dbl>
##  1         1     1993.
##  2         2     1996.
##  3         3     1993.
##  4         4     1992.
##  5         5     1995.
##  6         6     1999.
##  7         7     1995.
##  8         8     1992.
##  9         9     1992.
## 10        10     1998.
## # … with 990 more rows

In Figure @ref(fig:one-thousand-sample-means) let’s visualize the bootstrap distribution of these 1,000 means based on 1,000 virtual resamples:

virtual_resampled_means %>%
  ggplot(aes(x = mean_year)) +
  geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
  labs(x = "sample mean")
Bootstrap resampling distribution based on 1000 resamples.

Bootstrap resampling distribution based on 1000 resamples.

Note here that the bell shape is starting to become much more apparent. We now have a general sense for the range of values that the sample mean may take on. But where is this histogram centered? Let’s compute the mean of the 1,000 resample means:

virtual_resampled_means %>% 
  summarize(mean_of_means = mean(mean_year))
## # A tibble: 1 x 1
##   mean_of_means
##           <dbl>
## 1         1996.

The mean of these 1,000 means is 1995.51, which is quite close to the mean of our original sample of 50 pennies of 1995.44. This is the case since each of the 1,000 resamples is based on the original sample of 50 pennies.

Congratulations! You’ve just constructed your first bootstrap distribution! In the next section, you’ll see how to use this bootstrap distribution to construct confidence intervals.

Measuring uncertainty with confidence intervals

Let’s start this section with an analogy involving fishing. Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish!

Now think back to our pennies exercise where you are trying to estimate the true population mean year \(\mu\) of all US pennies. Think of the value of \(\mu\) as a fish.

On the one hand, we could use the appropriate point estimate/sample statistic to estimate \(\mu\), which we saw in Table @ref(tab:table-ch8-b) is the sample mean \(\overline{x}\). Based on our sample of 50 pennies from the bank, the sample mean was 1995.44. Think of using this value as “fishing with a spear.”

What would “fishing with a net” correspond to? Look at the bootstrap distribution in Figure @ref(fig:one-thousand-sample-means) once more. Between which two years would you say that “most” sample means lie? While this question is somewhat subjective, saying that most sample means lie between 1992 and 2000 would not be unreasonable. Think of this interval as the “net.”

What we’ve just illustrated is the concept of a confidence interval, which we’ll abbreviate with “CI” throughout this book. As opposed to a point estimate/sample statistic that estimates the value of an unknown population parameter with a single value, a confidence interval gives what can be interpreted as a range of plausible values. Going back to our analogy, point estimates/sample statistics can be thought of as spears, whereas confidence intervals can be thought of as nets.

Analogy of difference between point estimates and confidence intervals.

Analogy of difference between point estimates and confidence intervals.

Our proposed interval of 1992 to 2000 was constructed by eye and was thus somewhat subjective. We now introduce two methods for constructing such intervals in a more exact fashion: the percentile method and the standard error method.

Both methods for confidence interval construction share some commonalities. First, they are both constructed from a bootstrap distribution, as you constructed in Subsection @ref(bootstrap-1000-replicates) and visualized in Figure @ref(fig:one-thousand-sample-means).

Second, they both require you to specify the confidence level. Commonly used confidence levels include 90%, 95%, and 99%. All other things being equal, higher confidence levels correspond to wider confidence intervals, and lower confidence levels correspond to narrower confidence intervals. In this book, we’ll be mostly using 95% and hence constructing “95% confidence intervals for \(\mu\)” for our pennies activity.

Percentile method

One method to construct a confidence interval is to use the middle 95% of values of the bootstrap distribution. We can do this by computing the 2.5th and 97.5th percentiles, which are 1991.2 and 1999.741, respectively. This is known as the percentile method for constructing confidence intervals. You can get these values using the quantile() function on the mean_year column of virtual_resampled_means:

virtual_resampled_means %>%
  pull(mean_year) %>%
  quantile(c(0.025, 0.975))
##     2.5%    97.5% 
## 1991.200 1999.741

Let’s mark these percentiles on the bootstrap distribution with vertical lines in Figure @ref(fig:percentile-method). About 95% of the mean_year variable values in virtual_resampled_means fall between 1991.2 and 1999.741, with 2.5% to the left of the leftmost line and 2.5% to the right of the rightmost line.

(ref:perc-method) Percentile method 95% confidence interval. Interval endpoints marked by vertical lines.

(ref:perc-method)

(ref:perc-method)

Standard error method

If a numerical variable follows a normal distribution, or, in other words, the histogram of this variable is bell-shaped, then roughly 95% of values fall between \(\pm\) 1.96 standard deviations of the mean. Given that our bootstrap distribution based on 1,000 resamples with replacement in Figure @ref(fig:one-thousand-sample-means) is normally shaped, let’s use this fact about normal distributions to construct a confidence interval in a different way.

First, recall the bootstrap distribution has a mean equal to 1995.51. This value almost coincides exactly with the value of the sample mean \(\overline{x}\) of our original 50 pennies of 1995.44. Second, let’s compute the standard deviation of the bootstrap distribution using the values of mean_year in the virtual_resampled_means data frame:

virtual_resampled_means %>% 
  summarize(SE = sd(mean_year))
## # A tibble: 1 x 1
##      SE
##   <dbl>
## 1  2.18

What is this value? Recall that the bootstrap distribution is an approximation to the sampling distribution. Recall also that the standard deviation of a sampling distribution has a special name: the standard error. Putting these two facts together, we can say that 2.17868 is an approximation of the standard error of \(\overline{x}\).

Thus, using our 95% rule of thumb about normal distributions, we can use the following formula to determine the lower and upper endpoints of a 95% confidence interval for \(\mu\):

\[ \begin{aligned} \overline{x} \pm 1.96 \cdot SE &= (\overline{x} - 1.96 \cdot SE, \overline{x} + 1.96 \cdot SE)\\ &= (1995.44 - 1.96 \cdot 2.18, 1995.44 + 1.96 \cdot 2.18)\\ &= (1991.15, 1999.73) \end{aligned} \]

Let’s now add the SE method confidence interval with dashed lines in Figure @ref(fig:percentile-and-se-method).

(ref:both-methods) Comparing two 95% confidence interval methods.

## Warning: attributes are not identical across measure variables;
## they will be dropped
(ref:both-methods)

(ref:both-methods)

We see that both methods produce nearly identical 95% confidence intervals for \(\mu\) with the percentile method yielding \((1991.2, 1999.74)\) while the standard error method produces \((1991.17, 1999.71)\). Thus, the standard error method can be a good quick way to construct confidence intervals when you already have an estimate of the standard error and don’t want to go through the steps to obtain a confidence interval by the percentile method. It is particularly handy since \(1.96 \approx 2\), and thus it is easy to calculate in one’s head. However, recall that we can only use the standard error rule when the bootstrap distribution is roughly normally shaped. If you have an unusually shaped distribution, it is better to use the percentile method.

Interpreting confidence intervals

Now that we’ve shown you how to construct confidence intervals using a sample drawn from a population, let’s now focus on how to interpret them.

The traditional approach is referred to as either “Frequentist” or “Classical.” In this interpretation, 95% of the time that you perform this exercise, the intervals constructed will contain the true but unknown population parameter. You are making a claim about how well this approach — bootstrap resampling — works if you do it many, many times. You do not know anything about how well it worked this time.

In this book, we use a “Bayesian” interpretation. There is a true, but unknown, average year of minting for all the pennies in the world. In theory, we could find all those pennies and then calculate this number. The Bayesian interpretation of a 95% confidence interval is that we are 95% certain that the true value is within the CI limits. It would be a fair wager to bet, at 19:1 odds, that the true value is outside those limits because there is a 5% chance that it is.

Note that these are just interpretations. The reality of what we did is the same in both cases. The computer code is the same. This Bayesian interpretation of the bootstrap confidence interval is often called a credible interval in the academic literature. You may find yourself in a statistics class that uses the phrase “confidence interval” to refer only to the frequentist concept.

Does our percentile-based confidence interval of (1991.2, 1999.74) “capture” the true mean year \(\mu\) of all US pennies? Alas, we’ll never know, because we don’t know what the true value of \(\mu\) is. After all, we’re sampling to estimate it!

However, we can answer a different question. Given that we’ve observed the 50 pennies that we did, what’s the chance that the true value falls within (1991.2, 1999.74)? If the answer is 95%, then we can call this a 95% interval.

This is a statement about probability. We thus can think of the mean year of the pennies as a random variable that has its own probability distribution. Of course, we don’t know exactly what that distribution is. The bootstrap resampling procedure, however, generates a distribution of values for the mean. Since the bootstrap distribution is a good approximation of the sampling distribution we learned about in the last chapter (see Subsection @ref(bootstrap-vs-sampling)), this is a reasonable thing to do. Then, if we take the bootstrap distribution as our “best guess” of the probability distribution of the variable we are interested in, we can guess that the variable has a 95% chance of lying between the 2.5th and 97.5th percentiles of the bootstrap distribution.

Instead of looking at confidence intervals, a common alternative approach is to conduct hypothesis tests, where one hypothesis is called the “null hypothesis” (often that a value, such as a mean, is equal to zero) and the result of the test is either rejecting the null hypothesis (so you’d conclude that the mean is not zero) or failing to reject the null hypothesis. The decision whether to reject the null hypothesis is generally made with reference to a p-value, a measure of how likely one would be, if the null hypothesis were true, to observe results at least as extreme as the results actually observed. A p-value cutoff, often 0.05, is employed: if the p-value is lower, the null hypothesis is rejected, otherwise the hypothesis is not rejected.

We think this is a bad way to make decisions. Two very similar datasets could produce p-values of \(p = 0.04\) and \(p = 0.06\) for a quantity of interest. If you would make one decision in the former case and a totally different decision in the latter case, then there’s something wrong with your decision-making process! Rather, we think it is more sensible to look at the data, construct models to summarize important features of the data, and make decisions based on those models that take into account the uncertainty in the models’ estimates.

Width of confidence intervals

Now that we know how to interpret confidence intervals, let’s go over some factors that determine their width.

Impact of confidence level

One factor that determines confidence interval widths is the pre-specified confidence level. The quantification of the confidence level should match what many expect of the word “confident.” In order to be more confident in our best guess of a range of values, we need to widen the range of values.

Imagine we want to forecast the high temperature in Seoul, South Korea on August 15th. Given Seoul’s temperate climate with four distinct seasons, we could say somewhat confidently that the high temperature would be between 50°F - 95°F (10°C - 35°C). However, if we wanted a temperature range we were absolutely confident about, we would need to widen it.

We need this wider range to allow for the possibility of anomalous weather, like a freak cold spell or an extreme heat wave. So a range of temperatures we could be near certain about would be between 32°F - 110°F (0°C - 43°C). On the other hand, if we could tolerate being a little less confident, we could narrow this range to between 70°F - 85°F (21°C - 30°C).

Let’s revisit our sampling urn from Chapter @ref(sampling). Let’s compare \(10 \cdot 3 = 30\) confidence intervals for \(p\) based on three different confidence levels: 80%, 95%, and 99%.

Specifically, we’ll first take 30 different random samples of size \(n = 50\) balls from the urn. Then we’ll construct ten percentile-based confidence intervals using each of the three different confidence levels.

Finally, we’ll compare the widths of these intervals.

We’ll start by walking through the process for the 80% confidence intervals. We begin by taking 10 random samples from the urn of size \(n = 50\):

set.seed(9)

# Compute 80% confidence intervals

# Get 10 random samples

perc_cis_80 <- urn %>%
  rep_sample_n(size = 50, reps = 10)
perc_cis_80
## # A tibble: 500 x 3
## # Groups:   replicate [10]
##    replicate ball_ID color
##        <int>   <int> <chr>
##  1         1    2235 red  
##  2         1    1589 white
##  3         1    1286 white
##  4         1    1851 white
##  5         1     408 red  
##  6         1     595 red  
##  7         1    1027 red  
##  8         1     556 red  
##  9         1     549 white
## 10         1      42 red  
## # … with 490 more rows

Next, we’ll construct a confidence interval for each one. This will require nesting the data by sample and then mapping to construct each confidence interval. Let’s grab 1,000 bootstrap replicates for each sample:

perc_cis_80 <- perc_cis_80 %>%
  
  # For each one, construct a CI
  
  group_by(replicate) %>%
  nest() %>%
  mutate(boots = map(data, ~ rep_sample_n(., size = 50, replace = TRUE, reps = 1000)))

perc_cis_80
## # A tibble: 10 x 3
## # Groups:   replicate [10]
##    replicate data              boots                
##        <int> <list>            <list>               
##  1         1 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  2         2 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  3         3 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  4         4 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  5         5 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  6         6 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  7         7 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  8         8 <tibble [50 × 2]> <tibble [50,000 × 3]>
##  9         9 <tibble [50 × 2]> <tibble [50,000 × 3]>
## 10        10 <tibble [50 × 2]> <tibble [50,000 × 3]>

Now that we have a list column called boots for each of our original 10 samples, we can now get 1,000 estimates per sample of prop_red from the bootstrap:

perc_cis_80 <- perc_cis_80 %>%
  mutate(prop_red_results = map(boots, ~ group_by(., replicate) %>%
                                            summarize(prop_red = mean(color == "red"))),
         prop_red = map(prop_red_results, ~ pull(., prop_red)))
perc_cis_80
## # A tibble: 10 x 5
## # Groups:   replicate [10]
##    replicate data            boots              prop_red_results    prop_red    
##        <int> <list>          <list>             <list>              <list>      
##  1         1 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  2         2 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  3         3 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  4         4 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  5         5 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  6         6 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  7         7 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  8         8 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
##  9         9 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…
## 10        10 <tibble [50 × … <tibble [50,000 ×… <tibble [1,000 × 2… <dbl [1,000…

Finally, we will use the list column prop_red to get the lower and upper bounds of our confidence intervals and note the level of confidence we used:

perc_cis_80 <- perc_cis_80 %>%
  mutate(lower = map_dbl(prop_red, ~ quantile(., probs = 0.1)),
         upper = map_dbl(prop_red, ~ quantile(., probs = 0.9)),
         confidence_level = "80%") %>%
  select(replicate, lower, upper, confidence_level)
perc_cis_80
## # A tibble: 10 x 4
## # Groups:   replicate [10]
##    replicate lower upper confidence_level
##        <int> <dbl> <dbl> <chr>           
##  1         1  0.36  0.54 80%             
##  2         2  0.3   0.48 80%             
##  3         3  0.22  0.38 80%             
##  4         4  0.26  0.42 80%             
##  5         5  0.32  0.5  80%             
##  6         6  0.3   0.5  80%             
##  7         7  0.32  0.5  80%             
##  8         8  0.32  0.48 80%             
##  9         9  0.32  0.48 80%             
## 10        10  0.26  0.42 80%

Now we can do the same for the 95% and 99% confidence intervals:

# Compute 95% confidence intervals

# Get 10 random samples

perc_cis_95 <- urn %>%
  rep_sample_n(size = 50, reps = 10) %>%
  
  # For each one, construct a CI
  
  group_by(replicate) %>%
  nest() %>%
  mutate(boots = map(data, ~ rep_sample_n(., size = 50, replace = TRUE, reps = 1000)),
         prop_red_results = map(boots, ~ group_by(., replicate) %>%
                                  summarize(prop_red = mean(color == "red"))),
         prop_red = map(prop_red_results, ~ pull(., prop_red)),
         lower = map_dbl(prop_red, ~ quantile(., probs = 0.025)),
         upper = map_dbl(prop_red, ~ quantile(., probs = 0.975)),
         confidence_level = "95%") %>%
  select(replicate, lower, upper, confidence_level)


# Compute 99% confidence intervals

# Get 10 random samples

perc_cis_99 <- urn %>%
  rep_sample_n(size = 50, reps = 10) %>%
  
  # For each one, construct a CI
  
  group_by(replicate) %>%
  nest() %>%
  mutate(boots = map(data, ~ rep_sample_n(., size = 50, replace = TRUE, reps = 1000)),
         prop_red_results = map(boots, ~ group_by(., replicate) %>%
                                  summarize(prop_red = mean(color == "red"))),
         prop_red = map(prop_red_results, ~ pull(., prop_red)),
         lower = map_dbl(prop_red, ~ quantile(., probs = 0.005)),
         upper = map_dbl(prop_red, ~ quantile(., probs = 0.995)),
         confidence_level = "99%") %>%
  select(replicate, lower, upper, confidence_level)

# Combine into single data frame

percentile_cis_by_level <- bind_rows(perc_cis_80, perc_cis_95, perc_cis_99)

Observe that as the confidence level increases from 80% to 95% to 99%, the confidence intervals tend to get wider as seen in Table @ref(tab:perc-cis-average-width) where we compare their average widths.

Average width of 80, 95, and 99% confidence intervals
Confidence level Mean width
80% 0.172
95% 0.268
99% 0.348

So in order to have a higher confidence level, our confidence intervals must be wider. Ideally, we would have both a high confidence level and narrow confidence intervals. However, we cannot have it both ways. If we want to be more confident, we need to allow for wider intervals. Conversely, if we would like a narrow interval, we must tolerate a lower confidence level.

The moral of the story is: Higher confidence levels tend to produce wider confidence intervals. When looking at Table @ref(tab:perc-cis-average-width) it is important to keep in mind that we kept the sample size fixed at \(n = 50\). Thus, all \(10 \cdot 3 = 30\) random samples from the urn had the same sample size. What happens if instead we took samples of different sizes? Recall that we did this in Subsection @ref(different-shovels) using virtual shovels with 25, 50, and 100 slots.

Fitting multiple models using map()

Note that we have used the same code three times. That means we should write a function! Let’s adapt the code from the last section to take 10 samples of an arbitrary size and then construct confidence intervals of an arbitrary level:

urn_confidence <- function(bowl, n = 50, level = 0.95) {
  
  # Get lower and upper probabilities
  
  lower_prob = (1 - level) / 2
  upper_prob = level + lower_prob
  
  # Get 10 random samples
  
  bowl %>%
    rep_sample_n(size = n, reps = 10) %>%
    
      # For each one, construct a CI
    
    group_by(replicate) %>%
    nest() %>%
    mutate(boots = map(data, ~ rep_sample_n(., size = n, replace = TRUE, reps = 1000)),
           prop_red_results = map(boots, ~ group_by(., replicate) %>%
                                    summarize(prop_red = mean(color == "red"))),
           prop_red = map(prop_red_results, ~ pull(., prop_red)),
           lower = map_dbl(prop_red, ~ quantile(., probs = lower_prob)),
           upper = map_dbl(prop_red, ~ quantile(., probs = upper_prob)),
           sample_size = n,
           confidence_level = level) %>%
    select(replicate, lower, upper, sample_size, confidence_level)
}

Note that we saved the sample size as sample_size and the confidence level as confidence_level for convenient display later.

So we now could create our objects like so:

perc_cis_80 <- urn_confidence(bowl = urn, n = 50, level = 0.80)
perc_cis_95 <- urn_confidence(bowl = urn, n = 50, level = 0.95)
perc_cis_99 <- urn_confidence(bowl = urn, n = 50, level = 0.99)

# Combine into single data frame

percentile_cis_by_level <- bind_rows(perc_cis_80, perc_cis_95, perc_cis_99)
percentile_cis_by_level
## # A tibble: 30 x 5
## # Groups:   replicate [10]
##    replicate lower upper sample_size confidence_level
##        <int> <dbl> <dbl>       <dbl>            <dbl>
##  1         1  0.36  0.54          50              0.8
##  2         2  0.26  0.42          50              0.8
##  3         3  0.34  0.52          50              0.8
##  4         4  0.28  0.46          50              0.8
##  5         5  0.36  0.52          50              0.8
##  6         6  0.3   0.48          50              0.8
##  7         7  0.28  0.44          50              0.8
##  8         8  0.28  0.46          50              0.8
##  9         9  0.3   0.46          50              0.8
## 10        10  0.28  0.44          50              0.8
## # … with 20 more rows

But this still isn’t the best way. Note that we have three objects we need to deal with. Furthermore, we don’t actually know that the names are accurate! Perhaps we made a mistake and created perc_cis_99 with the code perc_cis_99 <- urn_confidence(urn, 50, 0.95); our object name will now be misleading as to what’s actually in the object.

Instead, let’s store our results in a single tibble by using map() to create a list column.

tibble(level = c(0.80, 0.95, 0.99)) %>%
  mutate(data = map(level, ~ urn_confidence(bowl = urn, n = 50, level = .)))
## # A tibble: 3 x 2
##   level data             
##   <dbl> <list>           
## 1  0.8  <tibble [10 × 5]>
## 2  0.95 <tibble [10 × 5]>
## 3  0.99 <tibble [10 × 5]>

Now that we have a list column, we want to unnest() it so we can see our actual data. Further, we won’t need the redundant column level any more:

tibble(level = c(0.80, 0.95, 0.99)) %>%
  mutate(data = map(level, ~ urn_confidence(bowl = urn, n = 50, level = .))) %>%
  unnest(data) %>%
  select(-level)
## # A tibble: 30 x 5
##    replicate lower upper sample_size confidence_level
##        <int> <dbl> <dbl>       <dbl>            <dbl>
##  1         1  0.26  0.42          50              0.8
##  2         2  0.26  0.44          50              0.8
##  3         3  0.3   0.48          50              0.8
##  4         4  0.12  0.26          50              0.8
##  5         5  0.28  0.44          50              0.8
##  6         6  0.3   0.48          50              0.8
##  7         7  0.28  0.44          50              0.8
##  8         8  0.14  0.3           50              0.8
##  9         9  0.38  0.56          50              0.8
## 10        10  0.28  0.44          50              0.8
## # … with 20 more rows

Now that we can create an object with any confidence level we please, why not calculate the widths for confidence levels from 0.50 to 0.99 and plot the results?

# Same code as before, but for more levels

tibble(level = seq(0.50, 0.99, by = 0.01)) %>%
  mutate(data = map(level, ~ urn_confidence(bowl = urn, n = 50, level = .))) %>%
  unnest(data) %>%
  select(-level) %>%
  
  # Calculate mean widths
  
  mutate(width = upper - lower) %>%
  group_by(confidence_level) %>%
  summarize(mean_width = mean(width)) %>%
  
  # Plot results
  
  ggplot(aes(x = confidence_level, y = mean_width)) +
  geom_point() +
  labs(x = "Confidence level",
       y = "Mean width",
       title = "Change in mean width of CI as level increase")

As the confidence level gets larger, the width of the interval gets larger too.

Impact of sample size

This time, let’s fix the confidence level at 95%, but consider three different sample sizes for \(n\): 25, 50, and 100. Specifically, we’ll first take 10 different random samples of size 25, 10 different random samples of size 50, and 10 different random samples of size 100. We’ll then construct 95% percentile-based confidence intervals for each sample. Finally, we’ll compare the widths of these intervals.

Note that now that we have a function, urn_confidence(), we can do this very easily!

set.seed(9)


perc_cis_n_25  <- urn_confidence(bowl = urn, n = 25, level = 0.95)
perc_cis_n_50  <- urn_confidence(bowl = urn, n = 50, level = 0.95)
perc_cis_n_100 <- urn_confidence(bowl = urn, n = 100, level = 0.95)

# Combine into single data frame

percentile_cis_by_n <- bind_rows(perc_cis_n_25, perc_cis_n_50, perc_cis_n_100)

Let’s compare the average widths in Table @ref(tab:perc-cis-average-width-2). Observe that as the confidence intervals are constructed from larger and larger sample sizes, they tend to get narrower.

Average width of 95% confidence intervals based on \(n = 25\), \(50\), and \(100\)
Sample size Mean width
25 0.364
50 0.274
100 0.186

The moral of the story is: Larger sample sizes tend to produce narrower confidence intervals. Recall that this was a key message in Subsection @ref(moral-of-the-story). As we used larger and larger shovels for our samples, the sample proportions red, \(\widehat{p}\), tended to vary less. In other words, our estimates got more and more precise.

Recall that we visualized these results in Figure @ref(fig:comparing-sampling-distributions-3), where we compared the sampling distributions for \(\widehat{p}\) based on samples of size \(n\) equal to 25, 50, and 100. We also quantified the sampling variation of these sampling distributions using their standard deviation, which has that special name: the standard error. So as the sample size increases, the standard error decreases.

Let’s do this again, but creating everything in a single tibble so we can look at sample sizes from 25 to 200:

# Same code as before, but for more sample sizes

tibble(n = seq(25, 200, by = 5)) %>%
  mutate(data = map(n, ~ urn_confidence(bowl = urn, n = ., level = 0.95))) %>%
  unnest(data) %>%
  select(-n) %>%
  
  # Calculate mean widths
  
  mutate(width = upper - lower) %>%
  group_by(sample_size) %>%
  summarize(mean_width = mean(width)) %>%
  
  # Plot results
  
  ggplot(aes(x = sample_size, y = mean_width)) +
  geom_point() +
  labs(x = "Sample size",
       y = "Mean width",
       title = "Change in mean width of CI as sample size decreases")

As the sample size gets larger, the width of the interval gets smaller.

Using lm() and tidy() as a shortcut

Recall the confidence interval we constructed for the mean year of the pennies using the percentile method:

virtual_resampled_means %>%
  pull(mean_year) %>%
  quantile(c(0.025, 0.975))
##     2.5%    97.5% 
## 1991.200 1999.741

Is there a way to construct that interval without engaging in the bootstrap resampling? That is, can we approximate this by using the pennies_sample tibble directly?

It turns out that we can, using the lm() function. lm() stands for linear model, and we will be using it in the next two chapters. This function can get the mean of a variable using the following syntax:

lm(year ~ 1, data = pennies_sample)
## 
## Call:
## lm(formula = year ~ 1, data = pennies_sample)
## 
## Coefficients:
## (Intercept)  
##        1995

Once we learn about regression, we’ll learn why this is labeled the “intercept.” The key thing to note right now is that lm() takes two main arguments, formula and data. The notation year ~ 1 will get the mean of year.

The broom package makes it easier to work with lm() model objects. In particular, the function tidy() has the option conf.int = TRUE, which we can use to get a confidence interval:

library(broom)

lm(year ~ 1, data = pennies_sample) %>%
  tidy(conf.int = TRUE)
## # A tibble: 1 x 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    1995.      2.15      930. 1.03e-105    1991.     2000.

This presents a lot of columns that we don’t care about, so we will select() the relevant ones:

library(broom)

lm(year ~ 1, data = pennies_sample) %>%
  tidy(conf.int = TRUE) %>%
  select(estimate, conf.low, conf.high)
## # A tibble: 1 x 3
##   estimate conf.low conf.high
##      <dbl>    <dbl>     <dbl>
## 1    1995.    1991.     2000.

These are not identical to our bootstrap confidence interval estimates, but they are close enough. Thus, if you are interested in constructing a confidence interval for a mean, you can use lm() and tidy() as a shortcut. tidy() also has an argument, conf.level, which allows us to specify the confidence level. By default, it is 0.95.

Case study: Is yawning contagious?

Let’s apply our knowledge of confidence intervals to answer the question: “Is yawning contagious?” If you see someone else yawn, are you more likely to yawn? In an episode of the US show Mythbusters, the hosts conducted an experiment to answer this question. The episode is available to view in the United States on the Discovery Network website here and more information about the episode is also available on IMDb.

Mythbusters study data

Fifty adult participants who thought they were being considered for an appearance on the show were interviewed by a show recruiter. In the interview, the recruiter either yawned or did not. Participants then sat by themselves in a large van and were asked to wait. While in the van, the Mythbusters team watched the participants using a hidden camera to see if they yawned. The data frame containing the results of their experiment is available in the mythbusters_yawn data frame included in the moderndive package:

mythbusters_yawn
## # A tibble: 50 x 3
##     subj group   yawn 
##    <int> <chr>   <chr>
##  1     1 seed    yes  
##  2     2 control yes  
##  3     3 seed    no   
##  4     4 seed    yes  
##  5     5 seed    no   
##  6     6 control no   
##  7     7 seed    yes  
##  8     8 control no   
##  9     9 control no   
## 10    10 seed    no   
## # … with 40 more rows

The variables are:

  • subj: The participant ID with values 1 through 50.
  • group: A binary treatment variable indicating whether the participant was exposed to yawning. "seed" indicates the participant was exposed to yawning while "control" indicates the participant was not.
  • yawn: A binary response variable indicating whether the participant ultimately yawned.

Recall that you learned about treatment and response variables in Appendix @ref(rubin-causal-model).

Let’s use some data wrangling to obtain counts of the four possible outcomes:

mythbusters_yawn %>% 
  group_by(group, yawn) %>% 
  summarize(count = n())
## # A tibble: 4 x 3
## # Groups:   group [2]
##   group   yawn  count
##   <chr>   <chr> <int>
## 1 control no       12
## 2 control yes       4
## 3 seed    no       24
## 4 seed    yes      10

Let’s first focus on the "control" group participants who were not exposed to yawning. 12 such participants did not yawn, while 4 such participants did. So out of the 16 people who were not exposed to yawning, 4/16 = 0.25 = 25% did yawn.

Let’s now focus on the "seed" group participants who were exposed to yawning. Of these, 24 participants did not yawn, while 10 participants did yawn. So out of the 34 people who were exposed to yawning, 10/34 = 0.294 = 29.4% did yawn. Comparing these two percentages, the participants who were exposed to yawning yawned 29.4% - 25% = 4.4% more often than those who were not.

Sampling scenario

Let’s review the terminology and notation related to sampling we studied in Subsection @ref(terminology-and-notation). In Chapter @ref(sampling) our study population was the urn of \(N = 2400\) balls. Our population parameter of interest was the population proportion of these balls that were red, denoted mathematically by \(p\). In order to estimate \(p\), we extracted a sample of 50 balls using the shovel and computed the relevant point estimate: the sample proportion that were red, denoted mathematically by \(\widehat{p}\).

Who is the study population here? All humans? All the people who watch the show Mythbusters? It’s hard to say! This question can only be answered if we know how the show’s hosts recruited participants! In other words, what was the sampling methodology used by the Mythbusters to recruit participants? We alas are not provided with this information. Only for the purposes of this case study, however, we’ll assume that the 50 participants are a representative sample of all Americans given the popularity of this show. Thus, we’ll be assuming that any results of this experiment will generalize to all \(N = 327\) million Americans (2018 population).

Just like with our sampling urn, the population parameter here will involve proportions. However, in this case it will be the difference in population proportions \(p_{seed} - p_{control}\), where \(p_{seed}\) is the proportion of all Americans who if exposed to yawning will yawn themselves, and \(p_{control}\) is the proportion of all Americans who, if not exposed to yawning, still yawn themselves. Correspondingly, the point estimate/sample statistic based the Mythbusters’ sample of participants will be the difference in sample proportions \(\widehat{p}_{seed} - \widehat{p}_{control}\). Let’s extend our table of scenarios of sampling for inference to include our latest scenario.

Scenarios of sampling for inference
Scenario Population parameter Notation Point estimate Symbol(s)
1 Population proportion \(p\) Sample proportion \(\widehat{p}\)
2 Population mean \(\mu\) Sample mean \(\overline{x}\) or \(\widehat{\mu}\)
3 Difference in population proportions \(p_1 - p_2\) Difference in sample proportions \(\widehat{p}_1 - \widehat{p}_2\)

This is known as a two-sample inference situation since we have two separate samples. Based on their two-samples of size \(n_{seed}\) = 34 and \(n_{control}\) = 16, the point estimate is

\[ \widehat{p}_{seed} - \widehat{p}_{control} = \frac{24}{34} - \frac{12}{16} = 0.04411765 \approx 4.4\% \]

However, what if the Mythbusters repeated this experiment? Assume that they recruited 50 new participants and exposed 34 of them to yawning and 16 not. Would they obtain the exact same estimated difference of 4.4%? Probably not, again, because of sampling variation.

How does this sampling variation affect their estimate of 4.4%? In other words, what would be a plausible range of values for this difference that accounts for this sampling variation? We can answer this question with confidence intervals! We will construct a 95% confidence interval for \(p_{seed} - p_{control}\) using bootstrap resampling with replacement.

We make a couple of important notes. First, for the comparison between the "seed" and "control" groups to make sense, however, both groups need to be independent from each other. Otherwise, they could influence each other’s results. This means that a participant being selected for the "seed" or "control" group has no influence on another participant being assigned to one of the two groups. As an example, if there were a mother and her child as participants in the study, they wouldn’t necessarily be in the same group. They would each be assigned randomly to one of the two groups of the explanatory variable.

Second, the order of the subtraction in the difference doesn’t matter so long as you are consistent and tailor your interpretations accordingly. In other words, using a point estimate of \(\widehat{p}_{seed} - \widehat{p}_{control}\) or \(\widehat{p}_{control} - \widehat{p}_{seed}\) does not make a material difference, you just need to stay consistent and interpret your results accordingly.

Constructing the confidence interval

Let’s first construct the bootstrap distribution for \(\widehat{p}_{seed} - \widehat{p}_{control}\) and then use this to construct 95% confidence intervals for \(p_{seed} - p_{control}\).

Our first step is to perform bootstrap resampling with replacement like we did with the slips of paper in our pennies activity in Section @ref(resampling-tactile). However, instead of calculating a mean, we are now calculating a difference in means. As we’ll see, the basic process remains the same, which is one of the reasons the bootstrap is such a powerful tool.

We start by using rep_sample_n() to get 1,000 replicates, or, in other words, we bootstrap resample the 50 participants with replacement 1,000 times.

mythbusters_yawn %>%
  rep_sample_n(size = 50, reps = 1000, replace = TRUE)
## # A tibble: 50,000 x 4
## # Groups:   replicate [1,000]
##    replicate  subj group   yawn 
##        <int> <int> <chr>   <chr>
##  1         1    35 seed    yes  
##  2         1    13 control no   
##  3         1    17 seed    no   
##  4         1    25 control yes  
##  5         1     4 seed    yes  
##  6         1    38 seed    no   
##  7         1    41 seed    no   
##  8         1    41 seed    no   
##  9         1    44 seed    no   
## 10         1    18 seed    no   
## # … with 49,990 more rows

Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 50 participants with replacement 1,000 times and 50,000 = 1,000 \(\cdot\) 50. The variable replicate indicates which resample each row belongs to. So it has the value 1 50 times, the value 2 50 times, all the way through to the value 1000 50 times.

After we generate many replicates of bootstrap resampling with replacement, we next want to summarize the bootstrap resamples of size 50 with a single summary statistic, the difference in proportions. All this requires is taking the proportion of “yes” outcomes for both groups (control and seed) and then subtracting those proportions from one another. To do this, we will group_by() both replicate and group:

mythbusters_yawn %>%
  rep_sample_n(size = 50, reps = 1000, replace = TRUE) %>%
  group_by(replicate, group) %>%
  summarize(prop_yawn = mean(yawn == "yes"))
## # A tibble: 2,000 x 3
## # Groups:   replicate [1,000]
##    replicate group   prop_yawn
##        <int> <chr>       <dbl>
##  1         1 control    0.0588
##  2         1 seed       0.182 
##  3         2 control    0.3   
##  4         2 seed       0.275 
##  5         3 control    0.316 
##  6         3 seed       0.258 
##  7         4 control    0.389 
##  8         4 seed       0.312 
##  9         5 control    0.385 
## 10         5 seed       0.162 
## # … with 1,990 more rows

We now have two rows per replicate, the first being control and the second being seed. That means we can construct our difference in proportions by grouping by replicate and subtracting one prop_yawn from the other:

mythbusters_yawn %>%
  rep_sample_n(size = 50, reps = 1000, replace = TRUE) %>%
  group_by(replicate, group) %>%
  summarize(prop_yawn = mean(yawn == "yes")) %>%
  pivot_wider(names_from = group, values_from = prop_yawn) %>% 
  group_by(replicate) %>%
  summarize(diff_prop_yawn = seed - control)
## # A tibble: 1,000 x 2
##    replicate diff_prop_yawn
##        <int>          <dbl>
##  1         1        0.0347 
##  2         2       -0.0146 
##  3         3        0.149  
##  4         4       -0.0492 
##  5         5        0.132  
##  6         6        0.0921 
##  7         7       -0.00794
##  8         8       -0.0190 
##  9         9        0.0153 
## 10        10        0.0746 
## # … with 990 more rows

There are two important coding tricks. First, we use group_by() twice, once to perform a calculation for each replicate and group, and once to perform a calculation just for each replicate. Failure to keep track of exactly what is currently being group_by()’d is a common cause of bugs. Second, pivot_wider() helps us to bring values into the same row, this making later calculations (like subtraction) easier.

Observe that the resulting data frame has 1,000 rows and 2 columns corresponding to the 1,000 replicate ID’s and the 1,000 differences in proportions, one for each bootstrap resample, in diff_prop_yawn.

Next, let’s compute the 95% confidence interval for \(p_{seed} - p_{control}\) using the percentile method, in other words, by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped.

mythbusters_yawn %>%
  rep_sample_n(size = 50, reps = 1000, replace = TRUE) %>%
  group_by(replicate, group) %>%
  summarize(prop_yawn = mean(yawn == "yes")) %>%
  pivot_wider(names_from = group, values_from = prop_yawn) %>% 
  group_by(replicate) %>%
  summarize(diff_prop_yawn = seed - control) %>% 
  pull(diff_prop_yawn) %>%
  quantile(c(0.025, 0.975))
##       2.5%      97.5% 
## -0.2022059  0.3140117

There is one value of particular interest that this 95% confidence interval contains: zero. If \(p_{seed} - p_{control}\) were equal to 0, then there would be no difference in proportion yawning between the two groups. This would suggest that there is no associated effect of being exposed to a yawning recruiter on whether you yawn yourself.

In our case, since the 95% confidence interval includes 0, we cannot conclusively say if either proportion is larger. Of our 1,000 bootstrap resamples with replacement, sometimes \(\widehat{p}_{seed}\) was higher and thus those exposed to yawning yawned themselves more often. At other times, the reverse happened.

Say, on the other hand, the 95% confidence interval was entirely above zero. This would suggest that \(p_{seed} - p_{control} > 0\), or, in other words \(p_{seed} > p_{control}\), and thus we’d have evidence suggesting those exposed to yawning do yawn more often.

Furthermore, if the 50 participants were randomly allocated to the "seed" and "control" groups, then this would be suggestive that being exposed to yawning doesn’t not cause yawning. In other words, yawning is not contagious. However, no information on how participants were assigned to be exposed to yawning or not could be found, so we cannot make such a causal statement.

Using lm() and tidy() as a shortcut

We have learned that you can use lm() as a shortcut to construct confidence interval for a mean. You can also use it to construct a confidence interval for a difference in means. Recall that you’ll need the broom package loaded in order to use the tidy() function. Let’s start by considering the lm() syntax:

mythbusters_yawn %>% 
  mutate(yawn_numeric = ifelse(yawn == "yes", 1, 0)) %>% 
  lm(formula = yawn_numeric ~ group)
## 
## Call:
## lm(formula = yawn_numeric ~ group, data = .)
## 
## Coefficients:
## (Intercept)    groupseed  
##     0.25000      0.04412

There are three differences from what we’ve seen before. First, since yawn is a character variable, we had to create a new variable, yawn_numeric, which would take on the value 1 if there was a yawn and zero otherwise. Working with lm() requires numbers. Second, since we are interested in the difference in means by group, we used group instead of 1 on the right side of the ~. Third, because we are using lm() within a pipe, we need to explicitly declare our formula. If, instead, we had used the data argument, this would not have been necessary.

Next, let’s tidy() this object:

mythbusters_yawn %>% 
  mutate(yawn_numeric = ifelse(yawn == "yes", 1, 0)) %>% 
  lm(formula = yawn_numeric ~ group) %>%
  tidy(conf.int = TRUE) %>%
  select(term, estimate, conf.low, conf.high)
## # A tibble: 2 x 4
##   term        estimate conf.low conf.high
##   <chr>          <dbl>    <dbl>     <dbl>
## 1 (Intercept)   0.25     0.0199     0.480
## 2 groupseed     0.0441  -0.235      0.323

We are also selecting the term column, since now lm() gives estimates for two terms: (Intercept) and groupseed. We will hold off on interpreting lm() model objects until the next chapter, but, for now, all you need to know is that groupseed is the term that represents our difference in means. Let’s filter() to that term:

mythbusters_yawn %>% 
  mutate(yawn_numeric = ifelse(yawn == "yes", 1, 0)) %>% 
  lm(formula = yawn_numeric ~ group) %>%
  tidy(conf.int = TRUE) %>%
  select(term, estimate, conf.low, conf.high) %>% 
  filter(term == "groupseed")
## # A tibble: 1 x 4
##   term      estimate conf.low conf.high
##   <chr>        <dbl>    <dbl>     <dbl>
## 1 groupseed   0.0441   -0.235     0.323

Again, the confidence interval using lm() and tidy() is similar to what we saw using the bootstrap method, but lm() often runs much faster.

Conclusion

Comparing bootstrap and sampling distributions

Let’s talk more about the relationship between sampling distributions and bootstrap distributions.

Recall back in Subsection @ref(shovel-1000-times), we took 1,000 virtual samples from the urn using a virtual shovel, computed 1,000 values of the sample proportion red \(\widehat{p}\), then visualized their distribution in a histogram. Recall that this distribution is called the sampling distribution of \(\widehat{p}\) . Furthermore, the standard deviation of the sampling distribution has a special name: the standard error.

We also mentioned that this sampling activity does not reflect how sampling is done in real life. Rather, it was an idealized version of sampling so that we could study the effects of sampling variation on estimates, like the proportion of the shovel’s balls that are red. In real life, however, one would take a single sample that’s as large as possible, much like in the Obama poll we saw in Section @ref(sampling-case-study). But how can we get a sense of the effect of sampling variation on estimates if we only have one sample and thus only one estimate? Don’t we need many samples and hence many estimates?

The workaround to having a single sample was to perform bootstrap resampling with replacement from the single sample. We did this in the resampling activity in Section @ref(resampling-tactile) where we focused on the mean year of minting of pennies. We used pieces of paper representing the original sample of 50 pennies from the bank and resampled them with replacement from a hat. We had 35 of our friends perform this activity and visualized the resulting 35 sample means \(\overline{x}\) in a histogram in Figure @ref(fig:tactile-resampling-6).

This distribution was called the bootstrap distribution of \(\overline{x}\). We stated at the time that the bootstrap distribution is an approximation to the sampling distribution of \(\overline{x}\) in the sense that both distributions will have a similar shape and similar spread. Thus the standard error of the bootstrap distribution can be used as an approximation to the standard error of the sampling distribution.

Let’s show you that this is the case by now comparing these two types of distributions. Specifically, we’ll compare

  1. the sampling distribution of \(\widehat{p}\) based on 1,000 virtual samples from the urn from Subsection @ref(shovel-1000-times) to
  2. the bootstrap distribution of \(\widehat{p}\) based on 1,000 virtual resamples with replacement from Ilyas and Yohan’s single sample urn_sample_1.

Sampling distribution

Here is the code you saw in Subsection @ref(shovel-1000-times) to construct the sampling distribution of \(\widehat{p}\) shown again in Figure @ref(fig:sampling-distribution-part-deux), with some changes to incorporate the statistical terminology relating to sampling from Subsection @ref(terminology-and-notation).

# Take 1000 virtual samples of size 50 from the urn.

# DK: Need to do a better job of ensuring that this is the same as the one
# "seen" before.

set.seed(76)

virtual_samples <- urn %>% 
  rep_sample_n(size = 50, reps = 1000)

# Compute the sampling distribution of 1000 values of p-hat

sampling_distribution <- virtual_samples %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red")) %>% 
  mutate(prop_red = red / 50)

# Visualize sampling distribution of p-hat

sampling_distribution %>%
  ggplot(aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Proportion of 50 balls that were red", 
       title = "Sampling distribution")
Previously seen sampling distribution of sample proportion red for $n = 1000$.

Previously seen sampling distribution of sample proportion red for \(n = 1000\).

An important thing to keep in mind is the default value for replace is FALSE when using rep_sample_n(). This is because when sampling 50 balls with a shovel, we are extracting 50 balls one-by-one without replacing them. This is in contrast to bootstrap resampling with replacement, where we resample a ball and put it back, and repeat this process 50 times.

Let’s quantify the variability in this sampling distribution by calculating the standard deviation of the prop_red variable representing 1,000 values of the sample proportion \(\widehat{p}\). Remember that the standard deviation of the sampling distribution is the standard error, frequently denoted as se.

sampling_distribution %>% summarize(se = sd(prop_red))
## # A tibble: 1 x 1
##       se
##    <dbl>
## 1 0.0667

Bootstrap distribution

Here is the code to construct the bootstrap distribution of \(\widehat{p}\) based on Ilyas and Yohan’s original sample of 50 balls saved in urn_sample_1.

set.seed(14)
bootstrap_distribution <- urn_sample_1 %>% 
  rep_sample_n(size = 50, reps = 1000, replace = TRUE) %>%
  summarize(prop_red = mean(color == "red"))
bootstrap_distribution %>%
  ggplot(aes(x = prop_red)) +
  geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
  labs(x = "Proportion of 50 balls that were red",
       title = "Bootstrap distribution")
Bootstrap distribution of proportion red for $n = 1000$.

Bootstrap distribution of proportion red for \(n = 1000\).

bootstrap_distribution %>% summarize(se = sd(prop_red))
## # A tibble: 1 x 1
##       se
##    <dbl>
## 1 0.0675

Comparison

Now that we have computed both the sampling distribution and the bootstrap distributions, let’s compare them side-by-side in Figure @ref(fig:side-by-side). We’ll make both histograms have matching scales on the x- and y-axes to make them more comparable. Furthermore, we’ll add:

  1. To the sampling distribution on the top: a solid line denoting the proportion of the urn’s balls that are red \(p\) = 0.375.
  2. To the bootstrap distribution on the bottom: a dashed line at the sample proportion \(\widehat{p}\) = 21/50 = 0.42 = 42% that Ilyas and Yohan observed.
Comparing the sampling and bootstrap distributions of $\widehat{p}$.

Comparing the sampling and bootstrap distributions of \(\widehat{p}\).

There is a lot going on in Figure @ref(fig:side-by-side), so let’s break down all the comparisons slowly. First, observe how the sampling distribution on top is centered at \(p\) = 0.375. This is because the sampling is done at random and in an unbiased fashion. So the estimates \(\widehat{p}\) are centered at the true value of \(p\).

However, this is not the case with the following bootstrap distribution. The bootstrap distribution is centered at 0.42, which is the proportion red of Ilyas and Yohan’s 50 sampled balls. This is because we are resampling from the same sample over and over again. Since the bootstrap distribution is centered at the original sample’s proportion, it doesn’t necessarily provide a better estimate of \(p\) = 0.375. This leads us to our first lesson about bootstrapping:

The bootstrap distribution will likely not have the same center as the sampling distribution. In other words, bootstrapping cannot improve the quality of an estimate.

Second, let’s now compare the spread of the two distributions: they are somewhat similar. In the previous code, we computed the standard deviations of both distributions as well. Recall that such standard deviations have a special name: standard errors. Let’s compare them in Table @ref(tab:comparing-se).

Comparing standard errors
Distribution type Standard error
Sampling distribution 0.067
Bootstrap distribution 0.067

Notice that the bootstrap distribution’s standard error is a rather good approximation to the sampling distribution’s standard error. This leads us to our second lesson about bootstrapping:

Even if the bootstrap distribution might not have the same center as the sampling distribution, it will likely have very similar shape and spread. In other words, bootstrapping will give you a good estimate of the standard error.

Thus, using the fact that the bootstrap distribution and sampling distributions have similar spreads, we can build confidence intervals using bootstrapping as we’ve done all throughout this chapter!

Rubin Causal Model

Conclusion